898 research outputs found

    Pb-Hash: Partitioned b-bit Hashing

    Full text link
    Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of BB bits. With kk hashes for each data vector, the storage would be B×kB\times k bits; and when used for large-scale learning, the model size would be 2B×k2^B\times k, which can be expensive. A standard strategy is to use only the lowest bb bits out of the BB bits and somewhat increase kk, the number of hashes. In this study, we propose to re-use the hashes by partitioning the BB bits into mm chunks, e.g., b×m=Bb\times m =B. Correspondingly, the model size becomes m×2b×km\times 2^b \times k, which can be substantially smaller than the original 2B×k2^B\times k. Our theoretical analysis reveals that by partitioning the hash values into mm chunks, the accuracy would drop. In other words, using mm chunks of B/mB/m bits would not be as accurate as directly using BB bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) m=24m=2\sim 4. In some regions, Pb-Hash still works well even for mm much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine mm embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study

    Constrained Approximate Similarity Search on Proximity Graph

    Full text link
    Search engines and recommendation systems are built to efficiently display relevant information from those massive amounts of candidates. Typically a three-stage mechanism is employed in those systems: (i) a small collection of items are first retrieved by (e.g.,) approximate near neighbor search algorithms; (ii) then a collection of constraints are applied on the retrieved items; (iii) a fine-grained ranking neural network is employed to determine the final recommendation. We observe a major defect of the original three-stage pipeline: Although we only target to retrieve kk vectors in the final recommendation, we have to preset a sufficiently large ss (s>ks > k) for each query, and ``hope'' the number of survived vectors after the filtering is not smaller than kk. That is, at least kk vectors in the ss similar candidates satisfy the query constraints. In this paper, we investigate this constrained similarity search problem and attempt to merge the similarity search stage and the filtering stage into one single search operation. We introduce AIRSHIP, a system that integrates a user-defined function filtering into the similarity search framework. The proposed system does not need to build extra indices nor require prior knowledge of the query constraints. We propose three optimization strategies: (1) starting point selection, (2) multi-direction search, and (3) biased priority queue selection. Experimental evaluations on both synthetic and real data confirm the effectiveness of the proposed AIRSHIP algorithm. We focus on constrained graph-based approximate near neighbor (ANN) search in this study, in part because graph-based ANN is known to achieve excellent performance. We believe it is also possible to develop constrained hashing-based ANN or constrained quantization-based ANN

    Anti-Helicobacter pylori activity of steroidal alkaloids obtained from three Veratrum plants

    Get PDF
    Anti-Helicobacter pylori (HP) activities were examined, by disc method, on three total alkaloid fractions and fourteen steroidal alkaloids obtained from three Veratrum plants ( V. manckii, V. nigrum var. ussuriense and V. patulum) , which are used as a name of "Li-lu (藜蘆)" to treat aphasia arising from apoplexy, wind type dysentery, jaundice, headache, scabies, chronic malaria, etc. Among them, verapatulin (12) and veratramine (13) revealed anti-HP activities, and the disc-minimum inhibitory concentration (disk-MIC) value (10 μg/ml) of 12 against two standard HP strains, NCTC11637 and NCTC11916, was higher than that of a clinically used antibiotic, erythromycin (≦0.013 μg/ml) , but was comparable to those of penicillin G (3.1 μg/ml and 1.6 μg/ml, respectively). 漢薬"藜蘆"として用いられている3種のヴェラトラム属植物(V.maackii, V.nigrum var.ussuriense and V.patulum)から得た総アルカロイドフラクション3種及ぴステロイドアルカロイド14種について,抗ヘリコバクター・ピロリ活性をディスク法で測定した。調べたステロイドアルカロイドの中で,ヴェラパツリン(12)及ぴヴェラトラミン(13)が抗ヘリコバクター・ピロリ活性を示した。ヴェラパツリン(12)のヘリコバグター・ピロリ標準株2種(NCTC11637及ぴNCTC11916)に対するdisk MIC値は10μg/mlであり,臨床で用いられる抗生物質のエリスロマイシン(≦0.013μg/ml)よりは弱いが,ペニシリンG(各標準株に対して3.1μg/ml,1.6μg/ml)と同程度であった

    Asymmetric Hashing for Fast Ranking via Neural Network Measures

    Full text link
    Fast item ranking is an important task in recommender systems. In previous works, graph-based Approximate Nearest Neighbor (ANN) approaches have demonstrated good performance on item ranking tasks with generic searching/matching measures (including complex measures such as neural network measures). However, since these ANN approaches must go through the neural measures several times during ranking, the computation is not practical if the neural measure is a large network. On the other hand, fast item ranking using existing hashing-based approaches, such as Locality Sensitive Hashing (LSH), only works with a limited set of measures. Previous learning-to-hash approaches are also not suitable to solve the fast item ranking problem since they can take a significant amount of time and computation to train the hash functions. Hashing approaches, however, are attractive because they provide a principle and efficient way to retrieve candidate items. In this paper, we propose a simple and effective learning-to-hash approach for the fast item ranking problem that can be used for any type of measure, including neural network measures. Specifically, we solve this problem with an asymmetric hashing framework based on discrete inner product fitting. We learn a pair of related hash functions that map heterogeneous objects (e.g., users and items) into a common discrete space where the inner product of their binary codes reveals their true similarity defined via the original searching measure. The fast ranking problem is reduced to an ANN search via this asymmetric hashing scheme. Then, we propose a sampling strategy to efficiently select relevant and contrastive samples to train the hashing model. We empirically validate the proposed method against the existing state-of-the-art fast item ranking methods in several combinations of non-linear searching functions and prominent datasets

    Turn Fake into Real: Adversarial Head Turn Attacks Against Deepfake Detection

    Full text link
    Malicious use of deepfakes leads to serious public concerns and reduces people's trust in digital media. Although effective deepfake detectors have been proposed, they are substantially vulnerable to adversarial attacks. To evaluate the detector's robustness, recent studies have explored various attacks. However, all existing attacks are limited to 2D image perturbations, which are hard to translate into real-world facial changes. In this paper, we propose adversarial head turn (AdvHeat), the first attempt at 3D adversarial face views against deepfake detectors, based on face view synthesis from a single-view fake image. Extensive experiments validate the vulnerability of various detectors to AdvHeat in realistic, black-box scenarios. For example, AdvHeat based on a simple random search yields a high attack success rate of 96.8% with 360 searching steps. When additional query access is allowed, we can further reduce the step budget to 50. Additional analyses demonstrate that AdvHeat is better than conventional attacks on both the cross-detector transferability and robustness to defenses. The adversarial images generated by AdvHeat are also shown to have natural looks. Our code, including that for generating a multi-view dataset consisting of 360 synthetic views for each of 1000 IDs from FaceForensics++, is available at https://github.com/twowwj/AdvHeaT
    corecore